Aligning Noisy Parallel Corpora Across Language Groups : Word Pair Feature Matching by Dynamic Time Warping

نویسندگان

  • Pascale Fung
  • Kathleen McKeown
چکیده

We propose a new algorithm, DK-vec, for aligning pairs of Asian/Indo-European noisy parallel texts without sentence boundaries. The algorithm uses frequency, position and recency information as features for pattern matching. Dynamic Time Warping is used as the matching technique between word pairs. This algorithm produces a small bilingual lexicon which provides anchor points for alignment.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DOMAIN WORD TRANSLATION BY SPACE-FREQUENCY ANALYSIS OF CONTEXT LENGTH HISTOGRAMS - Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE Inte

We report a new statistical feature relating a bilingual word pair in a non-parallel English-Chinese corpus. It is found that the lengths of context segments of a word is closely correlated to that of its translation, even when the corpus is non-parallel, i.e., monolingual texts which are not translations of each other. The context segment length histogram of a word has a characteristic pattern...

متن کامل

Domain word translation by space-frequency analysis of context length histograms

We report a new statistical feature relating a bilingual word pair in a non-parallel English-Chinese corpus. It is found that the lengths of context segments of a word is closely correlated to that of its translation, even when the corpus is non-parallel, i.e., monolingual texts which are not translations of each other. The context segment length histogram of a word has a characteristic pattern...

متن کامل

The Impact of Lemmatization in Word Alignment

The focus of this thesis is on examining whether word alignment results can be improved in precision and recall through lemmatization, and extraction of lemma dictionaries from the resulting links. Lemmas are extracted from existing lexical resources in order to replace word forms in two parallel corpora documents, one featuring the language pair English-Swedish and the other the language pair ...

متن کامل

Multi-Dimensional Dynamic Time Warping for Gesture Recognition

We present an algorithm for Dynamic Time Warping (DTW) on multi-dimensional time series (MDDTW). The algorithm utilises all dimensions to find the best synchronisation. It is compared to ordinary DTW, where a single dimension is used for aligning the series. Both one-dimensional and multidimensional DTW are also tested when derivatives instead of feature values are used for calculating the warp...

متن کامل

Word Image Matching Using Dynamic Time Warping

Libraries and other institutions are interested in providing access to scanned versions of their large collections of handwritten historical manuscripts on electronic media. Convenient access to a collection requires an index, which is manually created at great labour and expense. Since current handwriting recognizers do not perform well on historical documents, a technique called word spotting...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/cmp-lg/9409011  شماره 

صفحات  -

تاریخ انتشار 1994